Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work. Feature engineering is fundamental to the application of the machine learning, and is both difficult and expensive. The need for manual feature engineering can be obviated by automated feature learning.
Feature Engineering is an informal topic, but it is considered essential in applied machine learning.
Coming up with features is difficult, time-consuming, requires expert knowledge. "Applied machine learning" is basically feature engineering. -- Andrew NG
When working on a machine learning problem, feature engineering is manually designing what the input x's should be. -- Shayne Miel
A feature is a piece of information that might be useful for prediction. Any attribute could be a feature, as long as it is useful to the model.
The purpose of a feature, other than being an attribute, would be much easier to understand in the context of a problem. A feature is a characteristic that might help when solving the problem.
Depending on a feature it could be strongly relevant(has information that doesn't exist in any other feature), relevant, weakly relevant(some information that other features include) or irrelevant. It is important to create a lot of features. Even if some of them are irrelevant, you can't afford missing the rest. Afterwards, feature selection can be used in order to prevent overfitting.
Feature explosion can be caused by feature combination or feature templates, both leading to a quick growth in the total number of features.
There are a few solutions to help stop features explosion such as regularization, kernel method, feature selection.
In machine learning, feature learning or representation learning is a set of techniques that learn a feature: a transformation of raw data input to a representation that can be effectively exploited in machine learning tasks. This obviates manual feature engineering, which is otherwise necessary, and allows a machine to both learn at a specific task(using the features) and learn the features themselves: to learn how to learn.
Feature learning is motivated by the fact that machine learning tasks such as classification often requires input that is mathematically and computationally convenient to process. However, real-world data such as images, video, and sensor measurement is usually complex, redundant, and highly variable. Thus, it is necessary to discover useful features or representations from raw data. Traditional hand-crafted features often require expensive human labor and often rely on expert knowledge. Also, they normally do not generalize well. This motivates the design of efficient feature learning techniques, to automate and generalize this.
Feature learning can be divided into two categories: supervised and unsupervised feature learning, analogous to these categories in machine learning generally.
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features(variables, predictors) for use in model construction. Feature selection techniques are used for three reasons:
The central premise when using a feature selection techniques is that the data contains many features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information. Redundant or irrelevant features are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.
Feature selection techniques should be distinguished from feature extraction. Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples(or data points). Archetypal cases for the application of feature selection include the analysis of written texts and DNA microarray data, where there are many thousands of features, and a few tens to hundreds of samples.
In [ ]: